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Abstract 

* We introduce Bayesian optimization, a technique developed for optimizing time-consuming engi¬ 
neering simulations and for fitting machine learning models on large datasets. Bayesian optimization 
guides the choice of experiments during materials design and discovery to find good material designs in 
as few experiments as possible. We focus on the case when materials designs are parameterized by a 
low-dimensional vector. Bayesian optimization is built on a statistical technique called Gaussian process 
regression, which allows predicting the performance of a new design based on previously tested designs. 
After providing a detailed introduction to Gaussian process regression, we introduce two Bayesian op¬ 
timization methods: expected improvement, for design problems with noise-free evaluations; and the 
knowledge-gradient method, which generalizes expected improvement and may be used in design prob¬ 
lems with noisy evaluations. Both methods are derived using a value-of-informat ion analysis, and enjoy 
one-step Bayes-optimality. 


1 Introduction 

In materials design and discovery, we face the problem of choosing the chemical structure, composition, 
or processing conditions of a material to meet design criteria. The traditional approach is to use iterative 
trial and error, in which we (1) choose some material design that we think will work well based on 
intuition, past experience, or theoretical knowledge; (2) synthesize and test the material in physical 
experiments; and (3) use what we learn from these experiments in choosing the material design to try 
next. This iterative process is repeated until some combination of success and exhaustion is achieved. 

While trial and error has been extremely successful, we believe that mathematics and computation 
together promise to accelerate the pace of materials discovery, not by changing the fundamental iterative 
nature of materials design, but by improving the choices that we make about which material designs to 
test, and by improving our ability to learn from previous experimental results. 

In this chapter, we describe a collection of mathematical techniques, based on Bayesian statistics 
and decision theory, for augmenting and enhancing the trial and error process. We focus on one class 
of techniques, called Bayesian optimization (BO), or Bayesian global optimization (BGO), which use 
machine learning to build a predictive model of the underlying relationship between the design parameters 
of a material and its properties, and then use decision theory to suggest which design or designs would 
be most valuable to try next. The most well-developed Bayesian optimization methods assume that 
(1) the material is described by a vector of continuous variables, as is the case, e.g., when choosing 
ratios of constituent compounds, or choosing a combination of temperature and pressure to use during 
manufacture; (2) we have a single measure of quality that we wish to make as large as possible; and (3) 
the constraints on feasible materials designs are all , so that any unknown constraints are incorporated 
into the quality measure. There is also a smaller body of work on problems that go beyond these 
assumptions, either by considering discrete design decision (such as small molecule design), multiple 
competing objectives, or by explicitly allowing unknown constraints. 

Bayesian optimization was pioneered by |33| . with early development through the 1970s and 1980s by 
Mockus and Zilinskas |37[ I36 |. Development in the 1990s was marked by the popularization of Bayesian 
optimization by Jones, Schonlau, and Welch, who, building on previous work by Mockus, introduced the 
Efficient Global Optimization (EGO) method [28]. This method became quite popular and well-known 
in engineering, where it has been adopted for design applications involving time-consuming computer 


1 


experiments, within a broader set of methods designed for optimization of expensive functions [3- In 
the 2000s, development of Bayesian optimization continued in statistics and engineering, and the 2010s 
have seen additional development from the machine learning community, where Bayesian optimization 
is used for tuning hyperparameters of computationally expensive machine learning models m- Other 
introductions to Bayesian optimization may be found in the tutorial article [6] and textbooks [9l|44], and 
an overview of the history of the field may be found in |45| . 

We begin in Section by introducing the precise problem considered by Bayesian Optimization. 
We then describe in Section the predictive technique used by Bayesian Optimization, which is called 
Gaussian Process (GP) regression. We then show, in Section]^ how Bayesian Optimization recommends 
which experiments to perform. In Section we provide an overview of software packages, both freely 
available and commercial, that implement the Bayesian Optimization methods described in this chapter. 
We offer closing remarks in Section 


2 Bayesian Optimization 

Bayesian optimization considers materials designs parameterized by a d-dimensional vector x. We suppose 
that the space of materials designs in which x takes values is a known set ^4 C R'*. 

For example, x = (*(1),..., x{d)) could give the ratio of each of d different constituents mixed together 
to create some aggregate material. In this case, we would choose A to be the set A = {x : x{i) = 1}. 

As another example, setting d = 2, x = (x(l),x(2)) could give the temperature (x(l)) and pressure 
(2;(2)) used in material processing. In this case, we would choose A to be the rectangle bounded by 
the experimental setup’s minimum and maximal achievable temperature on one axis, Tmin and Tmax, 
and the minimum and maximum achievable pressure on the other. As a final example, we could let 
X = (a;(l),..., x{d) be the temperatures used in some annealing schedule, assumed to be decreasing over 
time. In this case, we would set A to be the set {x : Tmax > a:(l) > • • • > x{d) > Tmin}- 

Let f{x) be the quality of the material with design parameter x. The function / is unknown, and 
observing f{x) requires synthesizing material design x and observing its quality in a physical experiment. 
We would like to find a design x for which f{x) is large. That is, we would like to solve 

max/(a:). (1) 

x^A 

This is challenging because evaluating f{x) is typically expensive and time-consuming. While the time 
and expense depends on the setting, synthesizing and testing a new material design could easily take 
days or weeks of effort and thousands of dollars of materials. 

In Bayesian optimization, we use mathematics to build a predictive model for the function / based on 
observations of previous materials designs, and then use this predictive model to recommend a materials 
design that would be most valuable to test next. We first describe this predictive model in Section 
which is performed using a machine learning technique called Gaussian process regression. We then 
describe, in Section]^ how this predictive model is used to recommend which design to test next. 


3 Gaussian Process regression 


The predictive piece of Bayesian optimization is based on a machine learning technique called Gaussian 
process regression. This technique is a Bayesian version of a frequentist technique called kriging, intro¬ 
duced in the geostatistics literature by South-African mining engineer Daniel Krige [30], and popularized 
later by Matheron and colleagues [35], as described in [8]. A modern monograph on Gaussian process 
regression is HD, and a list of software implementing Gaussian process regression may be found at |40 |. 

In Gaussian process regression, we seek to predict f{x) based on observations at previously evaluated 
points, call them xi,..., x„. We first treat the case where f{x) can be observed exactly, without noise, 
and then later treat noise in Section 3.5 In this noise-free case, our observations are yt = f{xi) for 
i = 1,..., n. 


Gaussian process regression is a Bayesian statistical method, and in Bayesian statistics we perform 
inference by placing a so-called prior probability distribution on unknown quantities of interest. The 
prior probability distribution is often called, more simply, the prior distribution or, even more simply, 
the prior. This prior distribution is meant to encode our intuition or domain expertise regarding which 
values for the unknown quantity of interest are most likely. We then use Bayes rule, together with 
any data observed, to calculate a posterior probability distribution on these unknowns. For a broader 
introduction to Bayesian statistics, see the textbook [H] or the research monograph [1]. 
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In Gaussian process regression, if we wish to predict the value of / at a single candidate point x* , it 
is sufficient to consider our unknowns to be the values of / at the previously evaluated points, Xi,..., Xn, 
and the new point x* at which we wish to predict. That is, we take our unknown quantity of interest to be 
the vector {f{xi),..., f{xn), f{x*)). We then take our data, which is f{xi ),..., f{xn), and use Bayes rule 
to calculate a posterior probability distribution on the full vector of interest, {f{xi),..., f{xn), fix*)), 
or, more simply, just on fix*). 

To calculate the posterior, we must first specify the prior, which Gaussian process regression assumes 
to be multivariate normal. It calculates the mean vector of this multivariate normal prior distribution 
using a function, called the mean function and written here as ^o(-)i which takes a single x as an 
argument. It applies this mean function to each of the points xi,... ,Xn,x* to create an n+ 1-dimensional 
column vector. Gaussian process regression creates the covariance matrix of the multivariate normal prior 
distribution using another function, called the covariance function or covariance kernel and written here 
as Eo(-, •), which takes a pair of points x,x' as arguments. It applies this covariance function to every 
pair of points in xi,..., x„, x to create an (n -|- 1) x (n -|- 1) matrix. 

Thus, Gaussian process regression sets the prior probability distribution to. 


'fixiY 


/ 

lio(xi) 


■Eo(xi,xi) ■ 

Eo(xi,x„) 

Eo(xi,x*)' 

\ 

■ 

_ 1 

~ Normal 

1 

1^0 (Xn ) 
_/io(x*)_ 


Eo(x„,xi) • 
_Eo(x*,xi) • 

Eq iXji, Xji) 
Eo(x*,x„) 

Eo(x„,x*) 
Eo(x*, x*)_ 
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The subscript “0” in /lo and Eq indicate that these functions are relevant to the prior distribution, before 
any data has been collected. 

We now discuss how the mean and covariance functions are chosen, focusing on the covariance function 
first because it tends to be more important in getting good results from Gaussian process regression. 


3.1 Choice of covariance function 

In choosing the covariance function Eo(-, ■), we wish to satisfy two requirements. 

The first is that it should encode the belief that points x and x' near each other tend to have more 
similar values for /(x) and fix'). To accomplish this, we want the covariance matrix in ([^ to have 
entries that are larger for pairs of points that are closer together, and closer to 0 for pairs of points that 
are further apart. 

The second is that the covariance function should always produce positive semidefinite covariance 
matrices in the multivariate normal prior. That is, if E is the covariance matrix in then we require 
that a^Ea > 0 for all column vectors a (where a is assumed to have the appropriate length, n -|- 1). 
This requirement is necessary to ensure that the multivariate normal prior distribution is a well-defined 
probability distribution, because if 9 is multivariate normal with mean vector fi and covariance matrix 
E, then the variance of a • 6 is a^Ea, and we require variances to be non-negative. 

Several covariance functions satisfy these two requirements. The most commonly used is called the 
squared exponential, or Gaussian kernel, and is given by. 


Eo(x, x') = a exp 



(3) 


This kernel is parameterized by d -I- 1 parameters: a, and Pi, ..., Pd. 

The parameter a > 0 controls how much overall variability there is in the function /. We observe 
that under the prior, the variance of /(x) is Var(/(x)) = Cov(/(x),/(x)) = a. Thus, when a is large, 
we are encoding in our prior distribution that /(x) is likely to take a larger range of values. 

The parameters Pi > 0 controls how quickly the function / varies with x. For example, consider the 
relationship between some point x and another point x' = x -f [1, 0,..., 0]. When Pi is small (close to 0), 
the covariance between /(x) and fix') is aexp(—/3i) « a, giving a correlation between /(x) and fix') 
of nearly 1. This reflects a belief that /(x) and fix') are likely to be very similar, and that learning the 
value of fix) will also teach us a great deal about fix'). In contrast, when Pi is large, the covariance 
between /(x) and fix') is nearly 0, given a correlation between /(x) and fix') that is also nearly 0, 
reflecting a belief that /(x) and fix') are unrelated to each other, and learning something about /(x) 
will teach us little about (x'). 
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Going beyond the squared exponential kernel There are several other possibilities for the 
covariance kernel beyond the squared exponential kernel, which encode different assumptions about the 
underlying behavior of the function /. One particularly useful generalization of the squared exponen¬ 
tial covariance kernel is the Matern covariance kernel, which allows more flexibility in modeling the 
smoothness of /. 

Before describing this kernel, let r = be the Euclidean distance between x and x', but 

where we have altered the length scale in each dimension by some strictly positive parameter Pi. Then, 
the squared exponential covariance kernel can be written as, Tjo{x,x') = a exp (—r^). 

With this notation, the Matern covariance kernel is, 


T,o{x,x') 


r{u) 






where Kjj is the modified Bessel function. If we take the limit as iz —^ oo, we obtain the squared 
exponential kernel Section 4.2 page 85). 

The Matern covariance kernel is useful because it allows modeling the smoothness of / in a more flex¬ 
ible way, as compared with the squared exponential kernel. Under the squared exponential covariance 
kernel, the function / is infinitely mean-square differentiabl^ which may not be an appropriate assump¬ 
tion in many applications. In contrast, under the Matern covariance kernel, / is fc-times mean-square 
differentiable if and only if iz > fc. Thus, we can model a function that is twice differentiable but no more 
by choosing iz = 5/2, and a function that is once differentiable but no more by choosing v = 3/2. 

While the squared exponential and Matern covariance kernels allow modeling a wide range of be¬ 
haviors, and together represent a toolkit that will handle a wide variety of applications, there are other 
covariance kernels. For a thorough discussion of these, see Chapter 4 of ED- 

Both the Matern and squared exponential covariance kernel require choosing parameters. While it 
certainly is possible for one to choose the parameters a and Pi (and v in the case of Matern) based on 
one’s intuition about /, and what kinds of variability / is likely to have in a particular application, it is 
more common to choose these parameters (especially a and Pi) adaptively, so as to best fit previously 
observed points. We discuss this more below in Section [3.6| First, however, we discuss the choice of the 
mean function. 


3.2 Choice of mean function 

We now discuss choosing the mean function /ro(-). Perhaps the most common choice is to simply set the 
mean function equal to a constant, fi. This constant must be estimated, along with parameters of the 
covariance kernel such as a and Pi, and is discussed in Section [3.6| 

Beyond this simple choice, if one believes that there will be trends in / that can be described in a 
parametric way, then it is useful to include trend terms into the mean function. This is accomplished by 
choosing 

.] 

where are known functions, and 7 j G R, along with /r G R, are parameters that must be estimated. 

A common choice for the Tj , if one chooses to include them, are polynomials in x up to some small 
order. For example, if d = 2, so x is two-dimensional, then one might include all polynomials up to 
second order, 4'i(x) = xi, 4>2(x) = X 2 , 4'3(x) = (xi)^, 4'4(x) = (x 2 )^, 4'5(x) = xiX 2 , setting J = 5. One 
recovers the constant mean function by setting J — 0. 

3.3 Inference 

Given the prior distribution § on /(xi),..., /(x„), /(x*), and given (noise-free) observations of/(xi),..., 
f{Xn), the critical step in Gaussian process regression is calculating the posterior distribution on f{x*). 
We rely on the following general result about conditional probabilities and multivariate normal distribu¬ 
tions. Its proof, which may be found in Section relies on Bayes rule and algebraic manipulation of the 
probability density of the multivariate normal distribution. 


^Being “mean-square differentiable” at x in the direction given by the unit vector e; means that the limit lim^_>o(/(x -|- 
~ exists in mean square. Being “fc-times mean-square differentiable” is defined analogously. 
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Proposition 1. Let 6 he a k-dimensional multivariate normal random column vector, with mean vector 
fi and covariance matrix E. Let fci > 1,^2 > 1 be two integers summing to k. Decompose 9, fi and E as 


^[1] 



9i2] 

, /i — 



^[1,1] ^[1.2] 
_S[2,1] ^[2,2] 


SO that and are ki-column vectors, and E[i^j] is a kt x kj matrix, for each i,j = 1,2. 

//Ei,i and E 2,2 are invertible, then, for any u £ the conditional distribution of 6[ 2 ] given that 
0[i] = u is multivariate normal with mean 


and covariance matrix 


M[2] + S[2,i]E[j^j^] (w — 

S[2,2] - S[2,l]Ejj^yE[i_2]. 


We use this proposition to calculate the posterior distribution on f{x*), given f{xi),..., f(xn)- 
Before doing so, however, we first introduce some additional notation. We let y\-n indicate the column 
vector [yi,... ,y„Y, and we let indicate the sequence of vectors {xi,... ,x„). We let f{xi:„) = 
[f{xi ),..., f{x„)]'^, and similarly for other functions of x, such as We introduce similar additional 

notation for functions that take pairs of points x,x', so that T.{xi-.n., xi-.n.) is the matrix 

-So{xi,xi) ■■■ y:q(xi,x„)- 

• ' . . ? 

.S]o(a:n,a;i) 12Q(xn,Xn). 


Eo(x*, 2;i;n) is the row vector [Eo(a;*, ii),..., Eo(a;*, 2;n)], and Eo(xi;„,a;*) is the column vector 
[Eo(a:i, X*), Eo(a:n, . 

This notation allows us to rewrite § as 


yi-.n 

f{x*)_ 


= Normal 


I' hoixi-.n) 

U . 


Eq (^1 :n , ^1 :n ) Eo(xi;n,^ ) 

Eo(a:*,a:i:„) Eo(a;*,a;*) 


(4) 


We now examine this expression in the context of Proposition [T| We set Spj = /( 2 :i;„), 9^2] = f{x*), 
M[l] — l^o(^l:n), M[2] — IXqI^X ), Ep — Eq (2^1 :n, ^1 :n), Ep2] — Eq \X2:n , X ), Epp| — Eq ,Xl;n), aud 
S[ 2 , 2 ] = Eo(a:*,a;*). 

Then, applying Proposition we see that the posterior distribution on f{x*) given observations 
yi = f{xi), i = 1,..., n is normal, with a mean pL„{x*) and variance a^{x*) given by, 


p„(a:*) =/ro(®*) + Eo(a;*, xi;„)Eo(xi;n, ^{f {xi,„) - fJ-oixi-.u)), (5) 

a-'i{x*) = Eo(a;*,a;*) - Eo(x*,Xi:„)Eo(xi:„,Xi:„)“^Eo(xi:„,x*). (6) 


The invertibility of Eo(xi;„, Xi;„) (and also Eo(x*,x*)) required by Proposition depends on the 
covariance kernel and its parameters (typically called hyperparameters), but this invertibility typically 
holds as long as these hyperparameters satisfy mild non-degeneracy conditions, and the are distinct, 
i.e., that we have not measured the same point more than once. For example, under the squared 
exponential covariance kernel, invertibility holds as long as a > 0 and the X\-n are distinct. If we have 
measured a point multiple times, then we can safely drop all but one of the measurements, here where 
observations are noise-free. Below, we treat the case where observations are noisy, and in this case 
including multiple measurements of the same point is perfectly reasonable and does not cause issues. 

Figure shows the output from Gaussian process regression. In the figure, circles show points 
{xi, f{xi)), the solid line shows fin{x*) as a function of x*, and the dashed lines are positioned at 
fj,„{x*) ± 1 . 96 ct „( x *), forming a 95% Bayesian credible interval for f{x*), i.e., an interval in which f{x*) 
lies with posterior probability 95%. (A credible interval is the Bayesian version of a frequentist confidence 
interval.) Because observations are noise-free, the posterior mean /r„(x*) interpolates the observations 

fix*)- 
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Figure 1: Illustration of Gaussian process regression with noise-free evaluations. The circles show previously 
evaluated points, {xi, f{xi)). The solid line shows the posterior mean, finix) as a function of x, which 
is an estimate f{x), and the dashed lines show a Bayesian credible interval for each f{x), calculated as 
Unix) ± 1.96cr„(a;). Although this example shows / taking a scalar input, Gaussian process regression can 
be used for functions with vector inputs. 


3.4 Inference with just one observation 

The expressions § and 0 are complex, and perhaps initially difficult to assimilate. To give more 
intuition about them, and also to support some additional analysis below in Section it is useful to 
consider the simplest case, when we have just a single measurement, n — 1. 

In this case, all matrices in 0 and 0 are scalars, So(a:*,Xi) = Eo(2;i,a;*), and the expressions 0 
and 0 can be rewritten as, 


a^ix*) 


fioix*) + ^rj^r^^ifixi) - fJ-oixi)), 


T,o{x*,x*) 


T,o{x*,xi)^ 

Eo(a;i,xi) 


(7) 

( 8 ) 


Intuition about the expression for the posterior mean We first examine 0. We see that 
the posterior mean of fix*), fii{x*), which we can think of as our estimate of fix*) after observing 
fixi), is obtained by taking our original estimate of fix*), fioix*), and adding to it a correction term. 
This correction term is itself the product of two quantities: the error fixi) — fioixi) in our original 
estimate of fixi), and the quantity . Typically, Eo(a;*,a:i) will be positive, and hence also 

Sol^i’xij • (hficall, Eo(xi, xi) is a variance, so is never negative.) Thus, if /(xi) is bigger than expected, 
/(xi) — po(xi) will be positive, and our posterior mean pi(x*) will be larger than our prior mean po(x*). 
In contrast, if /(xi) is smaller than expected, /(xi) — po(xi) will be negative, and our posterior mean 
fii{x*) will be smaller than our prior mean po(x*). 

We can examine the quantity understand the effect of the position of x* relative to xi on 

the magnitude of the correction to the posterior mean. Notice that x* only enters this expression through 
the numerator. If x* is close to xi, then Eo(x*, xi) will be large under the squared exponential and most 
other covariance kernels, and positive values for /(xi) — /ro(xi) will also cause a strong positive change 
in pi(x*) relative to po(x*). If x* is far from xi, then Eo(x*,xi) will be close to 0, and /(xi) — po(xi) 
will have little effect on pi(x*). 


Intuition about the expression for the posterior variance Now we examine ([^. We see 

that the variance of our belief on fix*) under the posterior, ai{x*), is smaller than its value under the 
prior, Eo(x*, X*). Moreover, when x* is close to xi, Eo(x*, xi) will be large, and the reduction in variance 
from prior to posterior will also be large. 

Conversely, when x* is far from xi, Eo(x*,xi) will be close to 0, and the variance under the posterior 
will be similar to its value under the prior. 
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As a final remark, we can also rewrite the expression ([^ in terms of the squared correlation under 
the prior, Corr(/(a:*),/(xi))^ = Eo(a;*, xi)/(Eo(a;*, 2 ;*)Eo(a;i, * 1 )) € [0,1], as 

0 - 1 ( 2 :*) = Eo(a;*,a;*) (l - Corr(/(a;*),/(xi))^) . 

We thus see that the reduction in variance of the posterior distribution depends on the squared correlation 
under the prior, with larger squared correlation implying a larger reduction. 


3.5 Inference with noisy observations 

The previous section assumed that f{x*) can be observed exactly, without any error. When f{x*) is the 
outcome of a physical experiment, however, our observations are obscured by noise. Indeed, if we were 
to synthesize and test the same material design x* multiple times, we might observe different results. 

To model this situation, Gaussian process regression can be extended to allow observations of the 
form, 

Vixi) = f{xi) + ei, 

where we assume that the ei are normally distributed with mean 0 and constant variance, A^, with 
independence across i. In general, the variance A^ is unknown, but we treat it as a known parameter of 
our model, and then estimate it along with all the other parameters of our model, as discussed below in 
Section 13.61 

These assumptions of constant variance (called homoscedasticity) and independence make the analysis 
significantly easier, although they are often violated in practice. Experimental conditions that tend to 
violate these assumptions are discussed below, as are versions of GP regression that can be used when 
they are violated. 


Analysis of independent homoscedastic noise To performance inference under independent 
homoscedastic noise, and calculate a posterior distribution on the value of the function /(a;*) at a given 
point x», our first step is to write down the joint distribution of our observations and the 

quantity we wish to predict, /(a;*), under the prior. That is, we write down the distribution of the vector 

[yi, ■ ■ • ,l/n,/(a:.)]. 

We first observe that [i/i,..., j/„, /(x*)] is the sum of [/(xi),..., f{xn), /(x,)] and another vector, 
[ei,..., e„, 0]. The first vector has a multivariate normal distribution given by The second vector 
is independent of the first and is also multivariate normal, with a mean vector that is identically 0, 
and a covariance matrix diag(A^,..., A^, 0). The sum of two independent multivariate normal random 
vectors is itself multivariate normal, with a mean vector and covariance matrix given, respectively, by 
the sums of the mean vectors and covariance matrices of the summands. This gives the distribution of 
[yi, ■ ■ • ,l/n,/(x.)] as 


yi-.n 

~ Normal ( 

A‘o(xi;n) 


fix*) 

V 

Mo(x*) 



Eo(xi;n5 Xl;n) “t“ A In Eo(xi;, 
Eo(x*,Xl:„) Eo(x* 


(9) 


where In is the n-dimensional identity matrix. 

As we did in Section|3(^ we can use Proposition[^with the above expression to compute the posterior 
on /(x*) given /(xi;„). We obtain, 

(x ) — /ro(x ) “t“ Eq (x ,Xl;n) ^Elo (xiiTi, XliTi) T A (yi:n /ao(Xl:n)) (19) 

(^nix*) = Eo(x’,x*) - Eo(x*,Xl:„) [Eq (xi,„,Xl:„) + A^7„] ^ Eo(Xl;„,X*). (11) 

If we set A^ =0, so there is no noise, then we recover |5f and §. 

Figure shows an example of a posterior distribution calculated with Gaussian process regression 
with noisy observations. Notice that the posterior mean no longer interpolates the observations, and the 
credible interval has a strictly positive width at points where we have measured. Noise prevents us from 
observing function values exactly, and so we remain uncertain about the function value at points we have 
measured. 


Going beyond homoscedastic independent noise Gonstant variance is violated if the exper¬ 
imental noise differs across materials designs, which occurs most frequently when noise arises during the 
synthesis of the material itself, rather than during the evaluation of a material that has already been cre¬ 
ated. Some work has been done to extend Gaussian process regression to flexibly model heteroscedastic 
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Figure 2: Illustration of Gaussian process regression with noisy evaluations. As in Figure]^ the circles show 
previously evaluated points, where yi is f{xi) perturbed by constant-variance independent noise. 

The solid line shows the posterior mean, yn{x) as a function of a;, which is an estimate of the underlying 
function /, and the dashed lines show a Bayesian credible interval for /, calculated as /i„(a;) ± 1.96cr„(a:). 


noise (i.e., noise whose variance changes) |2411311 111 I52 | . The main idea in much of this work is to use 
a second Gaussian process to model the changing variance across the input domain. Much of this work 
assumes that the noise is independent and Gaussian, though considers non-Gaussian noise. 

Independence is most typically violated, in the context of physical experiments, when the synthesis 
and evaluation of multiple materials designs is done together, and the variation in some shared component 
simultaneously influences these designs, e.g., through variation in the temperature while the designs are 
annealing together, or through variation in the quality of some constituent used in synthesis. We are 
aware of relatively little work modeling dependent noise in the context of Gaussian process regression 
and Bayesian optimization, with one exception being EH- 

3.6 Parameter Estimation 

The mean and covariance functions contain several parameters. For example, if we use the squared 
exponential kernel, a constant mean function, and observations have independent homoscedastic noise, 
then we must choose or estimate the parameters y, a, j5i,..., Pd, A. These parameters are typically called 
hyperparameters because they are parameters of the prior distribution. (A^ is actually a parameter of 
the likelihood function, but it is convenient to treat it together with the parameters of the prior.) While 
one may simply choose these hyperparameters directly, based on intuition about the problem, a more 
common approach is to choose them adaptively, based on data. 

To accomplish this, we write down an expression for the probability of the observed data yi-.n in 
terms of the hyperparameters, marginalizing over the uncertainty on f{xi-.n.)- Then, we optimize this 
expression over the hyperparameters to hnd settings that make the observed data as likely as possible. 
This approach to setting hyperparameters is often called empirical Bayes, and it can be seen as an 
approximation to full Bayesian inference. 

We detail this approach for the squared exponential kernel with a constant mean function. Estimating 
for other kernels and mean functions is similar. Using the probability distribution of yi.n from ([^, and 
neglecting constants, the natural logarithm of this probability, logp(i/i;„ | xi-.n) (called the “log marginal 
likelihood”), can be calculated as 

y) (Eo(Xl;,i,Xl;,i) “t“ A {yi:n y') ^ |Eo(3^1;n,^l:n) “t” A In \ , 

where | • | applied to a matrix indicates the determinant. 

To find the hyperparameters that maximize this log marginal likelihood (the neglected constant 
does not affect the location of the maximizer), we will take partial derivatives with respect to each 
hyperparameter. We will then use them to find maximizers of y and cr^ := a -|- A^ analytically, and then 
use gradient-based optimization to maximize the other hyperparameters. 
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Taking a partial derivative with respect to fj., setting it to zero, and solving for /r, we get that the 
value of /i that maximizes the marginal likelihood is 


((S0(X1:„, Xl:n) + A" Jn ) " . 


Define R as the matrix with components 


5 exp - ; 


i=j, 

i 7^ j, 


where g = -^ 


Then Eo( 2 :i;„,a;i;„) + X?In = a^R and g, can be written in terms of i? as p = 
The log marginal likelihood (still neglecting constants) becomes 


logp( 2 /l:n I a:i:n) 




Taking the partial derivative with respect to and noting that (i does not depend on , we solve for 
and obtain 

0-2 = -(yun - A)-R"^(yi:n “ A)- 
n 

Substituting this estimate, the log marginal likelihood becomes 


logp(yi:n I 2;i:„) ~ -log ( (yi:„ - A) R (yi:n “ A) 


( 12 ) 


The expression (121 cannot in general be optimized analytically. Instead, one typically optimizes it 


numerically using a first- or second-order optimization algorithm, such as Newton’s method or gradient 
descent, obtaining estimates for Pi,..., fid and g. These estimates are in turn substituted to provide an 
estimate of R, from which estimates A a^nd cr^ may be computed. Finally, using am;! tjje estimated 
value of (?, we may estimate a and A. 


3.7 Diagnostics 

When using Gaussian process regression, or any other machine learning technique, it is advisable to check 
the quality of the predictions, and to assess whether the assumptions made by the method are met. One 
way to do this is illustrated by Figure which comes from a simulation of blood flow near the heart, 
based on |33], for which we get exact (not noisy) observations of fix) 

This plot is created with a technique called leave-one-out cross validation. In this technique, we iterate 
through the datapoints xi-.n, yi-n, and for each i € {1,... ,n}, we train a Gaussian process regression 
model on all of the data except Xi,yi, and then use it, together with Xi, to predict what the value yi 
should be. We obtain from this a posterior mean (the prediction), call it fi-i{xi), and also a posterior 
standard deviation, call it a-i(xi). When calculating these estimates, it is best to separately re-estimate 
the hyperparameters each time, leaving out the data {xi,yi). We then calculate a 95% credible interval 
fi-i{xi) ±2a-i{xi), and create Figure by plotting “Predicted” vs. “Actual”, where the “Actual” 
coordinate (on the x-axis) is yi, and the “Predicted” value (on the y-axis) is pictured as an error bar 
centered at p.-i{xi) with half-width 2a-i{xi). 

If the uncertainty estimates outputted by Gaussian process regression are behaving as anticipated, 
then 95% of the credible intervals will intersect the diagonal line Predicted^Actual. Moreover, if Gaussian 
process regression’s predictive accuracy is high, then the credible intervals will be short, and their centers 
will be close to this same line Predicted=Actual. 

This idea may be extended to noisy function evaluations, under the assumption of independent 
homoscedastic noise. To handle the fact that the same point may be sampled multiple times, let m{x) 
be the number of times that a point x € {xi,... ,x„} was sampled, and let y(x) be the average of the 
observed values at this point. Moreover, by holding out all m(x) samples of x and training Gaussian 
process regression, we would obtain a normal posterior distribution on f{xi) that has mean fi-i{xi) and 
standard deviation <j-i{xi). 
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Figure 3: Diagnostic plot for Gaussian process regression, created with leave-one-out cross validation. For 
each point in our dataset, we hold that point {xi,yi) out, train on the remaining points, calculate a 95% 
credible interval for j/i, and plot this confidence interval as an error bar whose x-coordinate is the actual 
value yi. If Gaussian process regression is working well, 95% of the error bars will intersect the diagonal line 
Predicted=Actual. 


Since y{xi) is then the sum of f(xi) and some normally distributed noise with mean 0 and vari¬ 
ance /m{xi), the resulting distribution of j7(a:i) is normal with mean y,-i{xi) and standard deviation 
^ 0-2 .(a;;) + /m{xi). 

From this, a 95% credible interval for y{xi) is then y,-i{xi) ± 2y^cr^^(a;i) -|- X^/m{xi). We would 
plot Predicted vs. Observed by putting this credible interval along the y-axis at x-coordinate yixi). If 
Gaussian process regression is working well, then 95% of these credible intervals will intersect the line 
Predicted=Observed. 

For Gaussian process regression to best support Bayesian optimization, it is typically most important 
to have good uncertainty estimates, and relatively less important to have high predictive accuracy. 
This is because Bayesian optimization uses Gaussian process regression as a guide for deciding where 
to sample, and so if Gaussian process regression reports that there is a great deal of uncertainty at 
a particular location and thus low predictive accuracy, Bayesian optimization can choose to sample at 
this location to improve accuracy. Thus, Bayesian optimization has a recourse for dealing with low 
predictive accuracy, as long as the uncertainty is accurately reported. In contrast, if Gaussian process 
regression estimates poor performance at a location that actually has near-optimal performance, and 
also provides an inappropriately low error estimate, then Bayesian optimization may not sample there 
within a reasonable timeframe, and thus may never correct the error. 

If either the uncertainty is incorrectly estimated, or the predictive accuracy is unsatisfactorily low, 
then the most common “fixes” employed are to adopt a different covariance kernel, or to transform the 
objective function /. If the objective function is known to be non-negative, then the transformations 
log(/) and are convenient for optimization because they are both strictly increasing, and so do not 
change the set of maximizers (or minimizers). If / is not non-negative, but is bounded below by some 
other known quantity a, then one may first shift / upward by a. 


3.8 Predicting at more than one point 


Below, to support the development of the knowledge-gradient method in Sections |4.2| and |8.3[ it will be 
useful to predict the value of / at multiple points, *(,..., a;J, with noise. To do so, we could certainly 
apply (101 and (111 separately for each x\,... ,xl., and this would provide us with both an estimate (the 
posterior mean) and a measure of the size of the error in this estimate (the posterior variance) associated 
with each f{x*). It would not, however, quantify the relationship between the errors at several different 
locations. For this, we must perform the estimation jointly. 
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As we did in Section 3.5 we begin with our prior on [yi:n, which is, 


yi.n 


Normal 


/ro(a:i:n) 


^0 “t“ A In ^0 ) 

^0 5 ^0 (^1; A; 5 ^l;fc ) 


We then use Propositionto compute the posterior on given /(xim), which is multivariate 

normal with mean vector and covariance matrix T,n{x\.^,x\.^) given by, 


We see that setting k 


/^o(^l:fc) “t“ ^0 (^1: A;, ^l;n ) [^0 (^1 :n , ^1 :n ) “t" A In^ {^Pl'.n /ro(^l:n)), 

^0 (^1: A , ^1: A ) ^0 (^1: A i ) ^^0 (^l:n , ) “t“ A /n] ^0 (^l:n , ^1; A ) • 


= 1 provides the expressions JlOk and (111 from Section 3.5 


(13) 

(14) 


3.9 Avoiding matrix inversion 

The expressions ( |10| | and ( |11[ ) for the posterior mean and variance in the noisy case, and also 0 and ([^ 
in the noise-free case, include a matrix inversion term. Calculating this matrix inversion is slow and can 
be hard to accomplish accurately in practice, due to the finite precision of floating point implementations. 
Accuracy is especially an issue when E has terms that are close to 0, which arises when data points are 
close together. 

In practice, rather than calculating a matrix inverse directly, it is typically faster and more accurate 
to use a mathematically equivalent algorithm, which performs a Cholesky decomposition and then solves 
a linear system. This algorithm is described below, and is adapted from Algorithm 2.1 in Section 2.3 of 
m- This algorithm also computes the log marginal likelihood required for estimating hyperparameters 
in Section lT6l 


Algorithm 1 Implementation using Cholesky decomposition 

Require: xi:n (inputs), yi.,n (responses), T,q{x,x') (covariance function), (variance of noise), x* (test 
input). 

1; L = Cholesky (So(a;i:„, xun) -b A^/„) 

2: S = (L\ {yi,n - yoixi-.n))) 

3 : ynix*) = yo{x*) +T.oix*,Xi:n)6 
4: V = L\T,o{xi.,n,X*) 

5 : cr^(x*) = Eo(a;*,a;*) — v'^v 

6: \ogp{yi-.n I Xi,n) = (j/Un “ Po{xi-.n))'^ a - log Lji - f log 27r 

7 ; return y,n{x*) (mean), cr^{x*) (variance), logp(yi:„ | Xi.n) (log marginal likelihood). 


4 Choosing where to sample 

Being able to infer the value of the objective function f{x) at unevaluated points based on past data 
x-i-n,yi:n is Only One part of finding good designs. The other part is using this ability to make good 
decisions about where to direct future sampling. 

Bayesian optimization methods addresses this by using a measure of the value of the information that 
would be gained by sampling at a point. Bayesian optimization methods then choose the point to sample 
next for which this value is largest. A number of different ways of measuring the value of information 
have been proposed. Here, we describe two in detail, expected improvement [Ml HE], and the knowledge 
gradient |15l |46| , and then survey a broader collection of design criteria. 


4.1 Expected Improvement 

Expected improvement, as it was first proposed, considered only the case where measurements are free 
from noise. In this setting, suppose we have taken n measurements at locations x-i.n and observed yi-.n- 
Then 

fn = , inax f{xi) 

1 = 1, ...,n 
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is the best value observed so far. Suppose we are considering evaluating / at a new point x. After this 
evaluation, the best value observed will be 

f^+i = max(/(a;),/*), 


and the difference between these values, which is the improvement due to sampling, is 

/n+i - fl= ma.x{f{x) - /*, 0) = (fix) - /*) + , 

where a+ = max(a, 0) indicates the positive part function. 

Ideally, we would choose x to make this improvement as large as possible. Before actually evaluating 
f{x), however, we do not know what this improvement will be, so we cannot implement this strategy. 
However, we do have a probability distribution on f{x), from Gaussian process regression. The expected 
improvement, indicated EI(a:), is obtained by taking the expectation of this improvement with respect 
to the posterior distribution on f{x) given xi-,n,yi:n- 

m{x) = E^[{f{x)-f:)+], (15) 


where En\ ■] = £'[■ \x\-.n,yi:n\ indicates the expectation with respect to the posterior distribution. 

The expectation in ( |15[ ) can be computed more explicitly, in terms of the normal cumulative dis- 
tributio n fu nction (cdf) $(•), and the normal probability density function (pdf) Recalling from 

Section 3.3 that fix) ~ Normal(/r„(a;),CT^(a:)), where pnix) and a^ix) are given by ([^ and and 


integrating with respect to the normal distribution (a derivation may be found in Section]^, we obtain. 


Ei(.). (,„(.) - /a* ■ («) 

Figurej^plots this expected improvement for a problem with a one-dimensional input space. We can see 
from this plot that the expected improvement is largest at locations where both the posterior mean /in(x) 
is large, and also the posterior standard deviation a" is large. This is reasonable because those points 
that are most likely to provide large gains are those points that have a high predicted value, but that 
also have significant uncertainty. Indeed, at points where we have already observed, and thus have no 
uncertainty, the expected improvement is 0. This is consistent with the idea that, in a problem without 
noise, there is no value to repeating an evaluation that has already been performed. 

This idea of favoring points that, on the one hand, have a large predicted value, but, on the other 
hand, have a significant amount of uncertainty, is called the exploration vs. exploitation tradeoff, and 
appears in areas beyond Bayesian optimization, especially in reinforcement learning |29l I49 | and multi¬ 
armed bandit problems [231134] . In these problems, we are taking actions repeatedly over time whose 
payoffs are uncertain, and wish to simultaneously get good immediate rewards, while learning the reward 
distributions for all actions to allow us to get better rewards in the future. We emphasize, however, 
that the correct balance between exploration and exploitation is different in Bayesian optimization as 
compared with multi-armed bandits, and should more favor exploration: in optimization, the advantage 
of measuring where the predicted value is high is that these areas tend to give more useful information 
about where the optimum lies; in contrast, in problems where we must “learn while doing” like multi¬ 
armed bandits, evaluating an action with high predicted reward is good primarily because it tends to 
give a high immediate reward. 

We can also see the exploration vs. exploitation tradeoff implicit in the expected improvement function 
in the contour plot. Figure]^ This plot shows the contours of EI(a:) as a function of the posterior mean, 
expressed as a difference from the previous best, A„(a:) := yn.{x) — ff), and the posterior standard 
deviation cr„(a;). 

Given the expression ( |16[ l, Bayesian optimization algorithms based on expected improvement, such 
as the Efficient Global Optimization (EGO) algorithm proposed by |28| . and the earlier algorithms of 
Mockus (see, e.g., the monograph [36]), then recommend sampling at the point with the largest expected 
improvement. That is, 

Xn+i € argmaxEI(a;). (17) 

X 

Finding the point with largest expected improvement is itself a global optimization problem, like 
the original problem that we wished to solve 0 - Unlike 0 . however, EI(a;) can be computed quickly, 
and its first and second derivatives can also be computed quickly. Thus, we can expect to be able to 
solve Q relatively well using an off-the-shelf optimization method for continuous global optimization. 
A common approach is to use a local solver for continuous optimization, such as gradient ascent, in a 
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Figure 4: Upper panel shows the posterior distribution in a problem with no noise and a one-dimensional 
input space, where the circles are previously measured points, the solid line is the posterior mean and 

the dashed lines are at /r„(a;) ± 2cr„(a;). Lower panel shows the expected improvement EI(a;) computed from 
this posterior distribution. An “x” is marked at the point with the largest expected improvement, which is 
where we would evaluate next. 


C 

< 


Figure 5: Contour plot of the expected improvement, as a function of the difference in means A„(a;) := 
IJ-nix) — f* and the posterior standard deviation (Tn(x). The expected improvement is larger when the 
difference in means is larger, and when the standard deviation is larger. 
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multistart framework, where we start the local solver from many starting points chosen at random, and 
then select the best local solution discovered. In Section we describe several codes that implement 
expected improvement methods, and each makes its own choice about how to solve ( |17[ ). 

The algorithm given by ( |17| l is optimal under three assumptions: (1) that we will take only a single 
sample; (2) there is no noise in our samples; and (3) that the x we will report as our final solution (i.e., 
the one that we will implement) must be among those previously sampled. 

In practice, assumption (1) is violated, as Bayesian optimization methods like (171 are applied itera¬ 
tively, and is made simply because it simplifies the analysis. Being able to handle violations of assumption 
(1) in a more principled way is of great interest to researchers working on Bayesian optimization method¬ 
ology, and some partial progress in that direction is discussed in Section 4.3 Assumption (2) is also often 
violated in a broad class of applications, especially those involving physical experiments or stochastic 
simulations. In the next section, we present an algorithm, the knowledge-gradient algorithm [nms], 
that relaxes this assumption (2), and also allows relaxing assumption (3) if this is desired. 


4.2 Knowledge Gradient 


When we have noise in our samples, the derivation of expected improvement meets with difficulty. In 
particular, if we have noise, then = maxi=i,.,.,n /(xi) is not precisely known, preventing us from using 


the expression (16l. 


One may simply take a quantity like maxi=i,.,.,„ yi that is similar in spirit to = maxi=i,,.,,„ f{xi), 


and replace fn in (161 with this quantity, but the resulting algorithm is no longer justified by an optimality 
analysis. Indeed, for problems with a great deal of noise, maxi^i,.,.,„ tends to be significantly larger 
than the true underlying value of the best point previously sampled, and so the resulting algorithm may 
be led to make a poor tradeoff between exploration and exploitation, and exhibit poor performance in 
such situations. 

Instead, the knowledge-gradient algorithm |151 I46| takes a more principled approach, and starts 
where the derivation of expected improvement began, but fully accounts for the introduction of noise 
(assumption 2 in section [tT| , and the possibility that we wish to search over a class of solutions broader 
than just those that have been previously evaluated when recommending the final solution (assumption 


3 in section 4.11. 


We first introduce a set A„, which is the set of points from which we would choose final solution, if 
we were asked to recommend a final solution at time n, based on Xi-.n, yi-.n- For tractability, we suppose 
An is finite. For example, if A is finite, as it often is in discrete optimization via simulation problems, 
we could take A„ = A, allowing the whole space of feasible solutions. This choice was considered in 
|15| . Alternatively, one could take An = {xi, ..., Xn}, stating that one is willing to consider only those 
points that have been previously evaluated. This choice is consistent with the expected improvement 
algorithm. Indeed, we will see that when one makes this choice, and measurements are free from noise, 
then the knowledge-gradient algorithm is identical to the expected improvement algorithm. Thus, the 
knowledge-gradient algorithm generalizes the expected improvement algorithm. 

If we were to stop sampling at time n, then the expected value of a point x € An based on the 
information available would be En[f{x)] = pLn(x). In the special case when evaluations are free from 
noise, this is equal to fix), but when there is noise, these two quantities may differ. If we needed to 
report a final solution, we would then choose the point in A„ for which this quantity is the largest, i.e., 
we would choose argmax„,g^^ fJ,n{x). Moreover, the expected value of this solution would be 


Pn = max pnix). 

X^An 


If evaluations are free from noise and An = {xi.n}, then p* is equal to /*, but in general these quantities 
may differ. 

If we take one additional sample, then the expected value of the solution we would report based on 
this additional information is 

Pn + l = max Pn+lix), 

where as before, A„+i is some finite set of points we would be willing to consider when choosing a final 
solution. Observe in this expression that pn+i{x) is not necessarily the same as /r„(x), even for points 
X € {xi.n} that we had previously evaluated, but that pn+iix) can be computed from the history of 
observations x\,n+i, Vi-.n+i- 

The improvement in our expected solution value is then the difference between these two quantities, 
Pn+i — fjin- This improvement is random at time n, even fixing Xn+i, through its dependence on yn+i, 
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Figure 6: Upper panel shows the posterior distribution in a problem with independent normal homoscedastic 
noise and a one-dimensional input space, where the circles are previously measured points, the solid line is the 
posterior mean /i„(a;), and the dashed lines are at /J.„(a;) ± 2cr„(a;). Lower panel shows the natural logarithm 
of the knowledge-gradient factor KG(a;) computed from this posterior distribution, where An = An+i are 
the discrete grid 300}. An “x” is marked at the point with the largest KG factor, which is where the 

KG algorithm would evaluate next. 


but we can take its expectation. The resulting quantity is called the knowledge-gradient (KG) factor, 
and is written, 

KG„(a;) = £;„ [^*+1 -/r* I Xn+i = a:]. (18) 

Calculating this expectation is more involved than calculating the expected improvement, but nev¬ 
ertheless can also be done analytically in terms of the normal pdf and normal cdf. This is described in 
more detail in Section 18.31 

The knowledge-gradient algorithm is then the one that chooses the point to sample next that maxi¬ 
mizes the KG factor, 

argmax KG„ (x). 

X 

The KG factor for a one-dimensional optimization problem with noise is pictured in Figure We 
see a similar tradeoff between exploration and exploitation, where the KG factor favors measuring points 
with a large Unix) and a large cr„(x). We also see local minima of the KG factor at points where we 
previously evaluated, just as with the expected improvement, but because there is noise in our samples, 
the value at these points is not 0 — indeed, when there is noise, it may be useful to sample repeatedly 
at a point. 

Choice of A„ and An+i Recall that the KG factor depends on the choice of the sets An and An+i, 
through the dependence of gr) and on these sets. Typically, if we choose these sets to contain more 
elements, then we allow /r* and to range over a larger portion of the space, and we allow the KG 

factor calculation to more accurately approximate the value that would result if we allowed ourself to 
implement the best option. However, as we increase the size of these sets, computing the KG factor is 
slower, making implementation of the KG method more computationally intensive. 

For applications with a finite A, US] proposed setting An+i = An = A, which was seen to require 
fewer function evaluations to find points with large /, in comparison with expected improvement on 
noise-free problems, and in comparison with another Bayesian optimization method, sequential kriging 
optimization (SKO) [27] on noisy problems. However, the computation and memory required grows 
rapidly with the size of A, and is typically not feasible when A contains more than 10,000 points. 
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For large-scale applications, [i^ proposed setting An+i = = {xi-„+i} in (181, and called the 

resulting quantity the approximate knowledge gradient (AKG), observing that this choice maintained 
computational tractability as A grows, but also offers good performance. This algorithm is implemented 
in the DiceKriging package | 42| . 

Finally, in noise-free problems (but not in problems with noise), setting j4„+i = {x\-n+i} and An = 
{xi:n} recovers expected improvement. 


4.3 Going beyond one-step analyses, and other methods 

Both expected improvement and the knowledge-gradient method are designed to be optimal, in the special 
case where we will take just one more function evaluation and then choose a final solution. They are not, 
however, known to be optimal for the more general case in which we will take multiple measurements, 
which is the way they are used in practice. 

The optimal algorithm for this more general setting is understood to be the solution to a partially 
observable Markov decision process, but actually computing the optimal solution using this understanding 
is intractable using current methods |12| . Some work has been done toward the goal of developing such an 
optimal algorithm [22], but computing the optimal algorithm remains out of reach. Optimal strategies 
have been computed for other closely related problems in optimization of expensive noisy functions, 
including stochastic root-finding [51], multiple comparisons with a standard |53| . and small instances of 
discrete noisy optimization with normally distributed noise (also called “ranking and selection”) |13j . 

Expected improvement and the knowledge gradient are both special cases of the more general concept 
of value of information, or expected value of sample information (EVSI) [25], as they calculate the 
expected reward of a final implementation decision as a function of the posterior distribution resulting 
from some information, subtract from this the expected reward that would result from not having the 
information, and then take the expectation of this difference with respect to the information itself. 

Many other Bayesian optimization methods have been proposed. A few of these methods optimize the 
value of information, but are calculated using different assumptions than those used to derive expected 
improvement or value of information. A larger number of these methods optimize quantities that do not 
correspond to a value of information, but are derived using analyses that are similar in spirit. These 
include methods that optimize the probability of improvement | 33[ 1481 the entropy of the posterior 
distribution on the location of the maximum |5n| . and other composite measures involving the mean and 
the standard deviation of the posterior Ezj. 

Other Bayesian optimization methods are designed for problem settings that do not match the assump¬ 
tions made in this tutorial. These include [2l 1141 [3^ . which consider multiple objectives; [20l|2ll[7]|47], 
which consider multiple simultaneous function evaluations; [1611261 fTO] . which consider objective func¬ 
tions that can be evaluated with multiple fidelities and costs; [^, which considers Bernoulli outcomes, 
rather than normally distributed ones; [18], which considers expensive-to-evaluate inequality constraints; 
and [38] . which considers optimization over the space of small molecules. 


5 Software 

There are a number of excellent software packages, both freely available and commercial, that implement 
the methods described in this chapter, and other similar methods. 

• Metrics Optimization Engine (MOE), an open-source code in C-I--I- and Python, developed by the 
authors and engineers at Yelp. http://yelp.github.io/MOE/ 

• Spearmint, an open-source code in Python, implementing algorithms described in [d^- https: 
//github.com/JasperSnoek/spearmint 

• DiceKriging and DiceOptim, an open-source R package that implements expected improvement, 
the approximate knowledge-gradient method, and a variety of algorithms for parallel evaluations. 
An overview is provided in |42| . 

http://cran.r-project.org/web/packages/DiceOptim/index.html 

• TOMLAB, a commercial package for MATLAB. http://tomopt.com/tomlab/ 

• matlabKG, an open-source research code that implements the discrete knowledge-gradient method 
for small-scale problems. 

http://people.orie.Cornell.edu/pfrazier/src.html 

A list of software packages focused on Gaussian process regression (but not Bayesian optimization) 
may be found at http://www.gaussianprocess.org/ 
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6 Conclusion 


We have presented Bayesian optimization, including Gaussian process regression, the expected improve¬ 
ment method, and the knowledge-gradient method. In making this presentation, we wish to emphasize 
that this approach to materials design acknowledges the inherent uncertainty in statistical prediction 
and seeks to guide experimentation in a way that is robust to this uncertainty. It is inherently iterative, 
and does not seek to circumvent the fundamental trial-and-error process. 

This is in contrast with another approach to informatics in materials design, which holds the hope 
that predictive methods can short-circuit the iterative loop entirely. In this alternative view of the world, 
one hopes to create extremely accurate prediction techniques, either through physically-motivated ab 
initio calculations, or using data-driven machine learning approaches, that are so accurate that one can 
rely on the predictions alone rather than on physical experiments. If this can be achieved, then we can 
search over materials designs in silica, hnd those designs that are predicted to perform best, and test 
those designs alone in physical experiments. 

For this approach to be successful, one must have extremely accurate predictions, which limits its 
applicability to settings where this is possible. We argue that, in contrast, predictive techniques can 
be extremely powerful even if they are not perfectly accurate, as long as they are used in a way that 
acknowledges inaccuracy, builds in robustness, and reduces this inaccuracy through an iterative dialog 
with physical reality mediated by physical experiments. Moreover, we argue that mathematical tech¬ 
niques like Bayesian optimization, Bayesian experimental design, and optimal learning provide us the 
mathematical framework for accomplishing this goal in a principled manner, and for using our power to 
predict as effectively as possible. 
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8 Derivations and Proofs 

This section contains derivations and proofs of equations and theoretical results found in the main text. 


8.1 Proof of Proposition 

Proof. Using Bayes’ rule, the conditional probability density of Spj at a point up] given that 0[i] = up] 


p{0[2] = M[ 2 ] I ^[1] = M[ 1 ]) = - _ - ocp(6l[i, = U[i,,6lp, = up,) 


p(6i[i] = U[1]) 


oc exp I 



“[11 - m 

T 

E[i,i] E[ip] 

-1 

M[l] - M[ll 

V 2 

W[2] - P[2]_ 


_S[2,1] ^[2,2]_ 


M[2] - M[2]. 


(19) 


To deal with the inverse matrix in this expression, we use the following identity for inverting a block 

A BI 

where both A and D are invertible square matrices, is 


matrix: the inverse of the block matrix 


C D 


A 

B 

-1 

{A - BD-^C)-^ -{A - BD-^C)-^BD-^' 

c 

D 


-{D-CA-^B)-^CA-^ {D-CA-^B)-^ 


Applying (20 1 to (191, and using a bit of algebraic manipulation to get rid of constants, we have 
p{9[2] = W[ 2 ] I ^'[ 1 ] = M[ 1 ]) oc exp , 


( 20 ) 


( 21 ) 


where = /rp] — Epp]S(;^jj(u[i] — M[i]) and E"®" = E[ 2 p] — Ep,i]E(;^jjE[ip]. 

We see that (21 1 is simply the unnormalized probability density function of a normal distribution. 
Thus the conditional distribution of ^p] given 6 ^, 1 ] = u,i] is multivariate normal, with mean and 

covariance matrix E“®". □ 
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8.2 Derivation of Equation (16) 

Since f{x) ~ Normal(/i„(a:), cr^(a:)), the probability density of f{x) is p{f{x) = z) = exp ((2 — /j.„(x))^/2a„(x)^). 
We use this to calculate EI(a;): 


EI(x) = E4(f(x) -/:)+] 


-1: 

-I 


iz - a 


1 


V^cr„ (x) 


e dz 


/• V^cr„(®) 

/ OO 

{Pn{x) + {Z - p„{x))) 
1 


-(z-fxn.(x))^ 

e dz-/:{!-$ 


fn - Pnjx) 

a„(x) 


e dz-/:{!-$ 


f 


{z - Pn{x)) 


V^(T„ {x) 


V^a„{x) 


e dz + {p„{x) - f*) { 1 - $ 


fn /an(^) 
On{x) 

fn - Pnix) 
an{x) 


1 -IfX-nnM)^ 

^ + (Mn(2;) -/*)(!-$ 


= it^n{x) -/*)!-$ 


fn pn{xf) 
On{x) 


+ (Jn{x)ip 


fn Pni^) 
On{x) 

fn pn (a?) 

cr„(a;) 


f / \ f i^^{x) - f*\ . . f l^nix) - f*\ 

= ((..(X) - /,.)«■ ( j + x»(x)»" ( „„(,) j ■ 

8.3 Calculation of the KG factor 


The KG factor (18 1 is calculated by first considering how the quantity — /r* depends on the infor¬ 
mation that we have at time n, and the additional datapoint that we will obtain, yn+i- 

First observe that Pn+i — Pn is & deterministic function of the vector [pn+iix) : x £ An+i] and other 
quantities that are known at time n. Then, by applying the analysis in Section |3.5| but letting the 
posterior given xi-.n, yi-.n play the role of the prior, we obtain the following version of ( | 10 | l, which applies 
to any given x, 

yn + l{x) = Pn{x) + + (y„ + l - PniXn + l)) ■ (22) 


In this expression, y.n{-) and E„(-, •) are given by (131 and ( |14[ ). 

We see from this expression that y.n+iix) is a linear function of j/n+i, with an intercept and a slope 
that can be computed based on what we know after the nth measurement. 

We will calculate the distribution of j/n+i, given what we have observed at time n. First, /(x„+i)|a;i;n, j/i,, 
Normal (/r„(a;„+i),E„(a;„+i,a;„+i)). Since j/„+i = f{xn+i) +€„+!, where £n+i is independent with dis¬ 
tribution £n-i-i ~ Normal(0, A^), we have 

J/n-|-l|a;i,„, J/l:n ~ Normal [y,n{Xn+l),'EniXn + l,Xn+l) -I- A^) . 


Plugging the distribution of yn+i into (22 1 and doing some algebra, we have 

y,n+i{x)\xi:n,yi:n ~ Normal (y,n(x),d'^{x,Xn +l)), 
where d(x,Xn+i) = Moreover, we can write y,n+i(x) as 

yin+l{x) = Unix) + '5ix,Xn + l)Z, 

where Z = (t/n+i — fJ.n{xn+i))/\/'Enix„+i,Xn+i) + is a standard normal random variable, given xi,, 
and yi-.n- 

Now (fTsl becomes 


KGnix) =En 


max -I-CT(x',Xn+i)^ I x„+i = X 

:'eA„+i 


f-^n' 


Thus, computing the KG factor comes down to being able to compute the expectation of the maximum 
of a collection of linear functions of a scalar normal random variable. Algorithm 2 of [15] , with software 
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provided as part of the matlabKG library m, computes the quantity 


h{a, 6) = E 


max (ai + biZ) 


max ai 


for arbitrary equal-length vectors a and b. Using this ability, and letting be the vector : 

x' € and r7(A„+i, x) be the vector [a(x',x) : x' G A„+i], we can write the KG factor as 


KG„(a:) = h{fj,„{A„+i),a{A„+i,x)) + [max(/i„(^„+i)) - /r*]. 


If = An, as it is in the versions of the knowledge-gradient method proposed in | 15l I46| . then the 

last term max(/r„(A„+i)) — is equal to 0 and vanishes. 
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